Document Clustering in Biomedical literature
نویسنده
چکیده
As an unsupervised learning process, document clustering has been used to improve information retrieval performance by grouping similar documents and to help text mining approaches by providing a high-quality input for them. In this article, the authors propose a novel hybrid clustering technique that incorporates semantic smoothing of document models into a neural network framework. Recently, it has been reported that the semantic smoothing model enhances the retrieval quality in Information Retrieval (IR). Inspired by that, the authors developed and applied a context-sensitive semantic smoothing model to boost accuracy of clustering that is generated by a dynamic growing cell structure algorithm, a variation of the neural network technique. They evaluated the proposed technique on biomedical article sets from MEDLINE, the largest biomedical digital library in the world. Their experimental evaluations show that the proposed algorithm significantly improves the clustering quality over the traditional clustering techniques including k-means and self-organizing map (SOM). DOI: 10.4018/jdwm.2009080703 IGI PUBLISHING This paper appears in the publication, International Journal of Data Warehousing and Mining, Volume 5, Issue 4 edited by David Taniar © 2009, IGI Global 701 E. Chocolate Avenue, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5291 International Journal of Data Warehousing and Mining, 5(4), 44-57, October-December 2009 45 Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. to one another while documents from different clusters are dissimilar. Document clustering was originally studied to enhance the performance of information retrieval (IR) because similar documents tend to be relevant to the same user queries (Wang et al., 2002; Zamir & Etzioni, 1998). Document clustering has been used to facilitate nearestneighbor search (Buckley & Lewit, 1985), to support an interactive document browsing paradigm (Cutting et al., 1992; Gruber, 1993; Koller & Sahami, 1997; Gruber, 1993), and to construct hierarchical topic structures (van Rijsbergen, 1979). Thus, document clustering plays a more important role for IR and text mining communities since the most natural form for storing information is text, and text information has increased exponentially. In the biomedical domain, document clustering technologies have been used to facilitate the practice of evidence-based medicine. This is because document clustering enhances biomedical literature searching (e.g., MEDLINE searching) in several ways and literature searches are one of the core skills required for the practice of evidence-based medicine (Evidence-based Medicine Working Group, 1992). For example, Pratt and her colleagues (Pratt et al., 1999; Pratt & Fagan, 2000), and Lin and Demner-Fushman (2007) introduced interesting semantic document clustering approaches that automatically cluster biomedical literature (MEDLINE) search results into document groups for the better understanding of literature search results. Current information technologies allow us to acquire, store, archive, and retrieve documents electronically. To this end, document clustering has been given focal attention because document clustering assists users in discovering hidden similarities and key concepts in documents. One of most serious problems making document clustering difficult to deal with text information is that the size of text collections in digital libraries are increasing rapidly. To handle the increasing size of document collections, a clustering algorithm has to not only solve the incremental problem but it must also have high efficiency in a large dataset. Most document clustering algorithms require a form of data pre-processing including stop-word removal and feature selection. Through the data pre-processing, unimportant features are eliminated and the original dimension is reduced to a more manageable size. However, the data pre-processing has two problems. First, although the data preprocessing can reduce the original dimension size, the reduced dimension is still sparse, which is called “the curse of dimensionality”. As the result, clustering results are often low quality. Second, the reduction of dimensionality by the data-preprocessing may disturb the preservation of the original topological structure of the input data. To solve these problems, we propose a context-sensitive semantic smoothing of a document model and incorporate it into Dynamic Growing Cell Structure (DynGCS). The effect of model smoothing has not been extensively studied in the context of document clustering (Zhang et al., 2006). Most model-based clustering approaches simply use Laplacian smoothing to prevent zero probability (Nigam & McCallum, 1998; Zhong & Ghosh, 2005), while most similarity-based clustering approaches employ the heuristic TF*IDF scheme to discount the effect of general words (Steinbach et al., 2000). As showed in (Zhong & Ghosh, 2005), modelbased clustering has several advantages over discriminative based approaches. One of the advantages of model-based approaches is that it learns generative models from the documents, with each model representing one particular document set. Due to the promising results reported in model-based clustering approaches, we propose a novel semantic smoothing technique to improve clustering quality. DynGCS is an adaptive variant of an artificial neural network model, Self-Organizing Map (SOM), which is well suited for mapping high-dimensional data into a 2-dimensional representation space. The training process is based on weight vector adaptation with respect to the input vectors. SOM has shown to be a highly 12 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/article/dynamic-semantically-awaretechnique-document/37404
منابع مشابه
Biomedical ontology improves biomedical literature clustering performance: a comparison study
Document clustering has been used for better document retrieval and text mining. In this paper, we investigate if a biomedical ontology improves biomedical literature clustering performance in terms of the effectiveness and the scalability. For this investigation, we perform a comprehensive comparison study of various document clustering approaches such as hierarchical clustering methods, Bisec...
متن کاملBiomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey
In this survey paper, we discuss biomedical ontologies and major text mining techniques applied to biomedicine and healthcare. Biomedical ontologies such as UMLS are currently being adopted in text mining approaches because they provide domain knowledge for text mining approaches. In addition, biomedical ontologies enable us to resolve many linguistic problems when text mining approaches handle...
متن کاملA Coherent Biomedical Literature Clustering and Summarization Approach Through Ontology-Enriched Graphical Representations
In this paper, we introduce a coherent biomedical literature clustering and summarization approach that employs a graphical representation method for text using a biomedical ontology. The key of the approach is to construct document cluster models as semantic chunks capturing the core semantic relationships in the ontology-enriched scale-free graphical representation of documents. These documen...
متن کاملA Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information
In this article, we present a graph-based knowledge representation for biomedical digital library literature clustering. An efficient clustering method is developed to identify the ontology-enriched k-highest density term subgraphs that capture the core semantic relationship information about each document cluster. The distance between each document and the k term graph clusters is calculated. ...
متن کاملMining and its Application in Biomedical Domain
Semantic Text Mining and its Application in Biomedical Domain Illhoi Yoo Xiaohua Hu, Ph.D A huge amount of biomedical knowledge and novel discoveries have been produced and collected in text databases or digital libraries, such as MEDLINE, because the most natural form to store information is text. In order to cope with this pressing text information overload, text mining is employed. However, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016